At-least-once guarantees and repeatability

Reaqtorの信頼性モデルは、コンピュートノードが故障した場合に、入口イベントの再生による処理の繰り返しの可能性にソフトな上限を設けて、at-least-onceイベント処理を保証しています（チェックポインティングインターバルに時間的制約があります）。

Reaqtor’s reliability model offers at-least-once event processing guarantees with a soft upper bound to possible repetition of processing due to replay of ingress events in case of a compute node failure (time-bound to a checkpointing interval).

基本的な考え方は、イベント処理コードパス上でI/Oを発生させることなくイベントを処理し、計算状態の変更を永続化することで、システムのレイテンシーを低減することです。定期的なチェックポイントは、すべての状態変化を要約し、これらを永続的なストレージに確実に保存するために使用され、ノードが故障した場合の回復を可能にします。これらのチェックポイントには、すべてのリアクティブアーティファクトの（不変の）定義や、サブスクリプション内のすべてのクエリ演算子の実行時の状態（例えば、平均的なアグリゲーション演算子の実行中の合計とカウント）が含まれます。

The basic idea is process events without incurring any I/O on the event processing code path to persist changes to computational state, thus reducing the latency of the system. Periodic checkpoints are used to summarize all the state changes and reliably persist these to permanent storage, allowing for recovery in case of a node failure. These checkpoints include the (immutable) definitions of all reactive artifacts as well as the runtime state of all query operators in subscriptions (e.g. the running sum and count for an average aggregation operator).

チェックポイントは、ポーズ・ザ・ワールド・スケジューラ機構を用いて実装されており、イベントが流れていない間に、すべてのリアクティブアーティファクトの状態の一貫したスナップショットを取ることができます。インメモリのスナップショットが取得されると（通常、数ミリ秒で完了します）、イベントの流れが回復し、状態が複製された状態ストアに永続化されます。状態の永続化が成功すると（レプリケーションのために何秒もかかることがあります）、アーティファクトが正常に保存されたとマークされます。ただし、これらのアーティファクトは、再開された計算のために、その間に再び汚されている可能性があります。

Checkpoints are implemented using a pause-the-world scheduler mechanism, allowing to take a consistent snapshot of the state of all reactive artifacts while no events are flowing. After an in-memory snapshot is taken (which typically takes a few milliseconds), the flow of events is restored, and the state is persisted to a replicated state store. When the state is successfully persisted (which may take many seconds due to replication), artifacts get marked as successfully saved. Note that these artifacts may have been dirtied again in the meantime due to the resumed computation.

このチェックポイント処理は、一定の間隔で定期的に繰り返されます（ただし、ダーティな状態の量を推定し、チェックポイント間隔の上限を設定することで、自動的なチェックポイント処理をサポートするように拡張することも可能です）。クエリエンジンは、完全チェックポイントと差分チェックポイントをサポートしているため、各チェックポイントに書き込まれるデータ量は、状態変化の差分のみとなります。

This checkpointing process repeats periodically at fixed intervals (though it could be augmented to support automatic checkpointing based on an estimation of the amount of dirty state and a configurable upper limit to checkpoint intervals). The query engine supports full and differential checkpoints, thus reducing the data volume written each checkpoint to only the state change deltas.

別の実装方法も試作されていますが、ここでは、イベントの再生シーケンスが過度に長くならないようにしながら、ダーティな状態が十分に発生したときにエンジンが自分自身をチェックポイントするようになっています。これは、ガベージコレクタがメモリ管理操作を行うかどうかを決定するのに似ています。実際、GCの他のテクニック（世代に割り当ててアーティファクトをエージングする、状態空間を分割するために、頻繁に編集されるノードとほとんど変更されないノードを追跡するなど）も実験されています。これは将来的に革新的な分野になるかもしれません。

Alternative implementations have been prototyped, where an engine checkpoints itself when there’s sufficient dirty state, while avoiding excessively long replay sequences for events. This is similar to a garbage collector deciding if/when to perform memory management operations. In fact, other techniques from GC (such as aging artifacts by assigning them to generations, or keeping track of nodes with high frequency edits, versus rarely changing ones to partition the state space) have been experimented with as well. This could be an area of future innovcation.

これらのチェックポイントの一部として、エンジンを含む外部からイベントを受信するすべてのアーティファクト（observablesやsubjectなど）は、フェイルオーバー時にイベントを再生できるようにするためのシーケンス識別子を永続化します。チェックポイントの永続化に成功した後、これらのアーティファクトは、必要に応じて外部の再生履歴を刈り取ったり、トリミングしたりすることができます。Reaqtorが提供するat-least-once処理保証につながるのは、このフェイルオーバー時のイベントの再生です。クエリ演算子が反復的または収束的な動作をする場合、エグレス側で重複排除を行うことで、フェイルオーバー時に再生された結果として生じる重複イベントをフィルタリングすることができます。

As part of these checkpoints, all artifacts that receive events from outside the containing engine (e.g. observables or subjects) persist sequence identifiers that allow them to replay events in case of a failover. After successfully persisting a checkpoint, these artifacts can prune or trim external replay history, if they wish to do so. It is this replay of events in case of failover that leads to the at-least-once processing guarantee provided by Reaqtor. In case query operators have repeatable or convergent behavior, de-duplication on the egress side can be used to filter out duplicate events that resulted from replay during a failover.

例えば、1,000イベント/秒のストリームで1分ごとにチェックポイントを行う場合、最低でも直近の60,000イベントを保持する必要があります（チェックポイントの失敗や欠落は考慮しませんが、実際には保持サイズはこの計算の倍数になります）。

Note that the checkpointing interval dictates a retention size requirement, e.g. for checkpoints every minute on a stream producing 1,000 events/second, one needs at least retention of the last 60,000 events (not accounting for failed or missed checkpoints; in reality the retention size is a multiple of this back of the napkin calculation).

多くのReaqtorの本番環境では、クエリ評価ノードのセカンダリレプリカに状態を複製するメカニズムとして、キー/バリューストアをサポートするService Fabricを使用しています。最近のデプロイメントでは、代わりに信頼性の高いコレクションを使用するようになりました。典型的な構成では、クエリエバリュエータを5つのレプリカを持つステートフルなサービスとして実行するため、チェックポイントの各コミットは、少なくとも3つのレプリカに書き込みのクォーラムをもたらします。サービスファブリックの主な利点は、エンジンのステートストアをプライマリレプリカにコロケーションすることで、多くの小さな読み取りを伴うリカバリを高速に行うことができます。中央のストアを使用するとリカバリ時間が長くなりますが、チェックポイントのトランザクション一貫性が保証されていれば、別のアプローチ（例えばRedis）を使用しても目標を達成することができます。

In many of the Reaqtor production deployments, we use Service Fabric with its key/value store support as the mechanism to replicate state to secondary replicas of query evaluator nodes. More recent deployments have migrated to use reliable collections instead. A typical configuration runs the query evaluators as stateful services with 5 replicas, thus each commit of a checkpoint results in a quorum of writes on at least 3 replicas. A key advantage of Service Fabric is the colocation of an engine’s state store with the pimrary replica, which enables for fast recovery which involves a lot of small reads. Using a central store, recovery times would go up, but alternative approaches could be used to achieve these goals (e.g. Redis), as long as transactional consistency for checkpoints is guaranteed.

初期の段階では、RSL（Replicated State Library）の利用も検討しましたが、Azureの主要サービスと相談した結果、Service Fabricにも賭けてみることにしました。

Early on, we also investigated the use of Replicated State Library (RSL), but after consulting with key services in Azure, we decided to bet on Service Fabric as well.